Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix missing object.coffee outputs. #375

Merged
merged 2 commits into from
Mar 20, 2015

Conversation

NathanaelA
Copy link
Contributor

I have been working on a "optional" loaded javascript file that will import a pdf into a new pdf for output. Everything works other than pdfkit currently doesn't support outputting a "hex" value. i.e < 0A >, the Object.convert wants to either treat it as a String key i.e. /0A or as a Object << 0A >>

@NathanaelA
Copy link
Contributor Author

I have finished a module that can be integrated with pdfkit that will add a new function to the pdfkit prototype that will allow us to import multiple PDF into the current pdf stream. The code is now working on my box; but the only piece that is missing so that people don't have to pull "my" copy of pdfkit is the ability for me to output a value the way I need to to the stream. (In the only case I haven't been able to work around in the pdfkit) So I would be appreciate if we could get this merged in and deployed as soon as you have some time.

@NathanaelA
Copy link
Contributor Author

I should also say that as I was testing I realized just now that "true", "false", "null" and "" all also have to be passed through the system as a isRaw as they are string values but not a "/key" and not a "(string)" but actually the raw text. So I actually have a couple things now using that minor change to your objects.convert function.

@devongovett
Copy link
Member

Boolean values, null, etc. are already handled by the else case here which converts them to strings correctly. I'm curious what other "hex" values you need to import? Do you have an example?

@NathanaelA
Copy link
Contributor Author

I'll double check that true/false/null gets passed through, I left them as strings as they came in as string characters; but switching them to a real value is easy. However the "" (Which can be considered as blank or null depending on the key) and hex values (on PDF 1.7 Spec page 56) are still handled with the isRaw.

As for the usage; in the 100 random pdf's that I'm playing with I've seen hex values primarily in the ColorSpace key, but I've also seen it anywhere MultiByte strings are supposed to be. And their are a couple keys in the spec that are required to be Hex (For example page 185&186, SubType)

@NathanaelA
Copy link
Contributor Author

I've confirmed with some minor alterations that null/true/false will work fine, now I just need Hex and blank. I should probably state; when I import the pdf; I read the entire structure and all keys, strings, dictionary's, etc; then I spit out the all the data into pdfkit structures. This allows me to use pdf kit to create a page; import pages; create more pages; import pages; and then create more pages...

I don't modify any data other than all the references are rewritten (using your pdfreference) to be updated to stay in sync in the new document. So if I get passed in a Hex value; it gets spit back out as a hex value. The idea is to do as few changes to the stream as possible, this allows most the features of the existing pdf to continue to work even without me directly supporting them.

@devongovett
Copy link
Member

I'm already not really a fan of the isString hack here, and I'd rather not add another one. For hex strings, we could use Buffers (e.g. convert them to "<#{buf.toString('hex')}>"). Empty strings can be either () or <> (empty PDFObject.s i.e. with isString, or empty Buffer). Does that work for you?

@NathanaelA
Copy link
Contributor Author

I'm not a big fan of the isString / isRaw hacks either, & the only reason I went with isRaw (rather than .isHex) is it allowed me to standardize how to do different outputs depending on the object passed in; a "" gets output as "", not as ("") or /"" and hex gets output as . So creating new "objects" to output anything was simple. However, updating the code to support the Buffer object is also a good solution.

Since I really need this support; I am willing to do this which ever way you want:

  1. Leave it as is, two hacks (isString/isRaw).
  2. Update the code and refactor the isString; and change it to isRaw (so you only have one hack in the system still).
  3. Eliminate the isString in the system and my new isRaw, and replace it with code that supports doing Buffer.isBuffer & Buffer.toString() -- I assume that Buffer is emulated property on the browser as I believe you are using it elsewhere in your code.

@devongovett
Copy link
Member

Let's go with the buffer option to create hex strings. Browserify automatically handles buffer support in the browser.

As for isString, I think a better implementation would be to have a wrapper class, e.g. PDFString for that so we can do if object instanceof PDFString, but that doesn't necessarily need to be done in this PR. I went with name objects (e.g. /name) for strings by default since they are more commonly used, so there has to be a wrapper for strings somehow to distinguish them.

Where are you seeing empty strings without parentheses or angle brackets around them? I don't see anything in the spec for that... they say empty strings should be enclosed with parens.

… replace with the Buffer object so that anything can be supported.
@NathanaelA
Copy link
Contributor Author

I've updated the code to eliminate the .isRaw/.isString hacks; so they all now use Buffer objects. I've tested my code against it and everything appears to be working properly.

As for the Empty string/value; I think I saw it somewhere in the spec that it could be considered null (empty) in some cases; but the main thing is that I've ran into several pdf's that the code was:
"x 0 obj endobj" and then either a dictionary key reference or a reference in a array to that obj; so I am making the "assumption" that if it is in the wild; it needs to be kept as is. I don't want to arbitrarily delete a "empty" obj, when it has refs'. So I need to be able to create "x 0 obj endobj" which means passing a "" as the value to be stored.

@NathanaelA
Copy link
Contributor Author

Any news on if this will be accepted (again I'm willing to do whatever changes are needed); or do I need to point my repository to my own copy of PDFKit so that I can have a automatic deploy of both my pdfimport and pdfkit.

@devongovett devongovett merged commit cba6f80 into foliojs:master Mar 20, 2015
@devongovett
Copy link
Member

Sorry, there were merge conflicts here. I spent some time reworking this in 22a9bfd. I realized that JavaScript already has two string representations: literal and String object wrapper, and we can take advantage of this. Now, literals are converted to PDF names as before, and String objects are converted to PDF strings. Buffers are converted to hex strings.

What this means for you is that you'll need to parse the hex strings you get in as buffers, taking off the angle brackets: e.g. new Buffer(string.slice(1, -1), 'hex'). You'll also need to wrap strings as String objects, removing the parens: new String(string.slice(1, -1)).

I found the bit about null objects in the spec. From my reading, I think it means references that don't point to a valid object become null:

An indirect reference to an undefined object is not an error; it is simply treated as a reference to the null object. For example, if a file contains the indirect reference

    17 0 R

but does not contain the corresponding definition

    17 0 obj
      ...
    endobj

then the indirect reference is considered to refer to the null object.

I haven't seen anything about the contents of the referenced definition being empty. Have you tried just dropping them? Or how about putting null as the contents instead?

@NathanaelA
Copy link
Contributor Author

Only problem is now with your '<' + object.toString('hex') + '>' is that you are forcing buffer to be a hex. I would much much rather have this be just a straight: "object.toString();" so that I can control what gets output.

The reasons why is:

  1. I already read it in as a a raw string (that is how it is in the document); it is already in hex form; I don't want to convert it out of hex to normal characters so it can be re-converted back hex. So very wasteful of cpu and memory -- I would much rather just do new Buffer('<'.+ data + '>'); So what is read in is written back out as is; no conversions.
  2. By leaving it object.toString() I can also support other things like the blank issue "new Buffer('')" so that I get a "x 0 obj endobj" out. I have no idea what happens if I drop the ref, most pdf readers are pretty forgiving; but occasionally if you put the data in wrong they crash (took me a while to figure out what was causing my pdf reader to crash; and it ended up actually being a bad ref). What I do know is that that format is in a couple pdf's -- so I assume (maybe a bad assumption) that it is valid even though I didn't see it specifically documented in the pdf spec.

So any change I can get you to change it back to what my pull was of just the
else if Buffer.isBuffer(object)
object.toString()

@devongovett
Copy link
Member

The reason I did this was that making hex strings out of binary buffers could be useful for other purposes in the future, other than just what you're doing here. You already have to make a new Buffer either way, and the memory allocation is the expensive part, not the string conversion, which takes place both ways as well. Your way you have to make a buffer from the ascii string, and this way you have to make one from a hex string - probably no difference in terms of performance, and it will use less memory (2 characters in the buffer for every hex byte, vs 1 byte).

The spec seems clear enough to me:

The definition of an indirect object in a PDF file consists of its object number and generation number, followed by the value of the object itself bracketed between the keywords obj and endobj.

Containing nothing is invalid, pdf readers just happen to accept it. Can you try setting the value to null instead?

@NathanaelA
Copy link
Contributor Author

I switched it to null, and replaced my code to use your new String and Buffer methods and I am good to go. Thanks for getting these changes in! Any idea when npm will have it?

@devongovett
Copy link
Member

Sorry about the wait. Just released v0.7.1. Was trying to get together a bigger release, but it's taking longer than I hoped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants